Brief Introduction

This homework will analyze the Disease Prediction dataset which records disease related observations. The purpose of this homework is to find essential features and approproate modles to predict that if the patient is getting that disease or not. To get this estimated result, we should clean the dataset, find important features, and construct required models through using a series of Classification Models, such as ANN, Logistic Regression, KNN, XGBoost, SVM, Random Forest, and Naive Bayes. Then, we need to use those models to fit the test dataset and get estimated results. Finally, we will use several model performance evaluation techniques to assess each model and select the most appropriate model for this dataset.

Data Importing and Munging

Import dataset, convert data type, remove features with no or low variance, remove duplicates, checking and replacing null values, finding and replacing outliers.
Before analyzing the dataset, one should library all the necessary packages.

library(caret) 
library(car)
library(corrplot)
library(dplyr)
library(e1071)
library(factoextra)
library(ggplot2)
library(gplots)
library(ggpubr)
library(glmnet)
library(knitr)
library(keras)
library(klaR)
library(lime)
library(Matrix)
library(naivebayes)
library(naniar)
library(pROC)
library(plyr)
library(rattle)
library(rpart)
library(recipes)
library(RANN)
library(RColorBrewer)
library(tidyr)
library(tidyquant)
library(tidyverse)
library(xgboost)
library(yardstick)

Importing the Dataset.

my.dir <- getwd()
train <- read.csv(paste0(my.dir, "/","Disease Prediction Training.csv"), header = TRUE, stringsAsFactors = FALSE)
test <- read.csv(paste0(my.dir, "/","Disease Prediction Testing.csv"), header = TRUE, stringsAsFactors = FALSE)

Checking data structure
The weather dataset consists of 12 columns. 9 out of these 12 columns contain numeric data while the rest of the columns contain categorical data. Since the test dataset are similar with train dataset, it is unnecessary to display again.

dim(train)
## [1] 49000    12
str(train)
## 'data.frame':    49000 obs. of  12 variables:
##  $ Age                : int  59 64 41 50 39 54 48 51 42 41 ...
##  $ Gender             : chr  "female" "female" "female" "male" ...
##  $ Height             : int  167 150 166 172 162 163 159 171 161 159 ...
##  $ Weight             : num  88 71 83 110 61 61 89 71 72 43 ...
##  $ High.Blood.Pressure: int  130 140 100 130 110 120 150 110 150 90 ...
##  $ Low.Blood.Pressure : int  68 100 70 80 80 80 90 70 90 60 ...
##  $ Cholesterol        : chr  "normal" "normal" "normal" "normal" ...
##  $ Glucose            : chr  "normal" "normal" "normal" "normal" ...
##  $ Smoke              : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ Alcohol            : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ Exercise           : int  1 0 1 1 1 1 1 1 1 1 ...
##  $ Disease            : int  0 1 0 0 0 0 1 0 1 0 ...
summary(train)
##       Age           Gender              Height          Weight      
##  Min.   :29.00   Length:49000       Min.   : 55.0   Min.   : 10.00  
##  1st Qu.:48.00   Class :character   1st Qu.:159.0   1st Qu.: 65.00  
##  Median :53.00   Mode  :character   Median :165.0   Median : 72.00  
##  Mean   :52.85                      Mean   :164.4   Mean   : 74.19  
##  3rd Qu.:58.00                      3rd Qu.:170.0   3rd Qu.: 82.00  
##  Max.   :64.00                      Max.   :207.0   Max.   :200.00  
##  High.Blood.Pressure Low.Blood.Pressure Cholesterol          Glucose         
##  Min.   : -150.0     Min.   :    0.00   Length:49000       Length:49000      
##  1st Qu.:  120.0     1st Qu.:   80.00   Class :character   Class :character  
##  Median :  120.0     Median :   80.00   Mode  :character   Mode  :character  
##  Mean   :  128.7     Mean   :   96.92                                        
##  3rd Qu.:  140.0     3rd Qu.:   90.00                                        
##  Max.   :14020.0     Max.   :11000.00                                        
##      Smoke            Alcohol           Exercise         Disease   
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:1.0000   1st Qu.:0.0  
##  Median :0.00000   Median :0.00000   Median :1.0000   Median :0.0  
##  Mean   :0.08827   Mean   :0.05424   Mean   :0.8032   Mean   :0.5  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:1.0  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.0000   Max.   :1.0

Missing Values Management

Checking the columns with missing values and empty values. We can find that none of these columns has missing value. Here, neither the train or test dataset has missing value.

vis_miss(train)

vis_miss(test)


Empty Values Management
It is possible that those categorcial type columns contain empty value. To find out that, we can elaborate all the factors conatined inside.

categorical_columns <- c('Gender', 'Cholesterol', 'Glucose')
for (i in categorical_columns){
  print(unique(train[i]))
}
##   Gender
## 1 female
## 4   male
##    Cholesterol
## 1       normal
## 5         high
## 20    too high
##     Glucose
## 1    normal
## 5      high
## 18 too high
categorical_columns <- c('Gender', 'Cholesterol', 'Glucose')
for (i in categorical_columns){
  print(unique(test[i]))
}
##   Gender
## 1 female
## 3   male
##    Cholesterol
## 1         high
## 2       normal
## 42    too high
##     Glucose
## 1    normal
## 3      high
## 73 too high

Dealing with Duplicates

Checking any possible duplicate data for both weather and test dataset. There are 3061 duplicates inside the dataset. We can go through the top 10 duplicates. Then, all those duplicates should be eliminated.

duplicated <- train[duplicated(train),] %>% arrange(Age, Gender, Height, Weight)
dim(duplicated)
## [1] 1752   12
head(duplicated,10)
##    Age Gender Height Weight High.Blood.Pressure Low.Blood.Pressure Cholesterol
## 1   39 female    155     55                 110                 70      normal
## 2   39 female    156     69                 120                 80      normal
## 3   39 female    157     56                 110                 70      normal
## 4   39 female    158     58                 110                 70      normal
## 5   39 female    158     64                 120                 80      normal
## 6   39 female    158     64                 120                 80      normal
## 7   39 female    158     64                 120                 80      normal
## 8   39 female    160     58                 120                 80      normal
## 9   39 female    160     59                 110                 70      normal
## 10  39 female    161     58                 110                 70      normal
##    Glucose Smoke Alcohol Exercise Disease
## 1   normal     0       0        1       0
## 2   normal     0       0        1       0
## 3   normal     0       0        1       0
## 4   normal     0       0        1       0
## 5   normal     0       0        1       0
## 6   normal     0       0        1       0
## 7   normal     0       0        1       0
## 8   normal     0       0        1       0
## 9   normal     0       0        1       0
## 10  normal     0       0        1       0
train <- unique(train)

We have to manage the test dataset in the same way.

test <- unique(test)

Data Type Conversion

change the original data types.

num_var <- sapply(train, is.integer)
train[, num_var] <- lapply(train[, num_var], as.numeric)
char_var <- sapply(train, is.character)
train[, char_var] <- lapply(train[, char_var], as.factor)

Managing the test dataset in the same way.

num_var <- sapply(test, is.integer)
test[, num_var] <- lapply(test[, num_var], as.numeric)
char_var <- sapply(test, is.character)
test[, char_var] <- lapply(test[, char_var], as.factor)

0utliers Management

To find potential outliers, one can use boxplots to visualize the data. For this graph, one can conclude the columns with outliers:
* Age
* Heigth
* Weight
* High.Blood.Pressure
* Low.Blood.Pressure

1. Even though there are several columns with outliers, we have to manage those outliers separately. Some of them will not be managed since those outliers are plausible to exist in real world. For example, the minimum of “Age” (which is 29) has been recognized as outlier. However, this number is highly possible to exist.
2. Some of them will be eliminated, replaced, or truncated since it contain very strange value. Take the blood pressure as an example. It is a common sense that blood pressure could not be negative or higher than 300 mm Hg. After googling, the American Heart Association publiced that there are two numbers in a blood pressure reading, which are the upper (systolic) and lower (diastolic) numbers. Normally, lower blood pressure ranges between 60 and 140 while upper blood pressure ranges between 90 and 250. Besides, it is also abnormal to find that the value of Lower Blood Pressure is higher than that of High Blood Pressure.
3. What’s more, abnormalies also could be found in “Heigth” and “Weigth” columns. For example, a male whose heigth is 178 only weighted 11kg.

train %>% keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(y = value, fill = "orange")) +
  facet_wrap(~key, scales = "free") +
  geom_boxplot() +
  labs(title = "Boxplots of Numeric Columns") +
  theme_minimal()



To manage those unrealistic outliers, we can check their quantile values. The 2.5% and 97.5% quantitles for the Low.Blood.Pressure are 60 and 100 while those of values for High.Blood.Pressure are 100 and 170.

quantile(train$Low.Blood.Pressure, c(0.025,0.975))
##  2.5% 97.5% 
##    60   106
quantile(train$High.Blood.Pressure, c(0.025,0.975))
##  2.5% 97.5% 
##   100   170


Combing the quantile results above and information we obtained from the Internet, we decide to set 60 and 140 as the boundary of Low.Blood.Pressure while to set 90 and 250 as the boundary for High.Blood.Pressure could be 90 and 250.

train <- train %>% filter(Low.Blood.Pressure >= 60 & Low.Blood.Pressure <= 140)
train <- train %>% filter(High.Blood.Pressure >= 90 & High.Blood.Pressure <= 250)
train <- train %>% filter(Low.Blood.Pressure < High.Blood.Pressure)


For outliers in “Heigth” and “Weigth”, we can follow the same processes mentioned above. Then, we can decide that the boundary for heigth is 51 and 108 while the boundary for Height is 150 and 180.

quantile(train$Weight, c(0.025,0.975))
##  2.5% 97.5% 
##    51   108
quantile(train$Height, c(0.025,0.975))
##  2.5% 97.5% 
##   150   180
train <- train %>% filter(Weight >= 51 & Weight <= 108)
train <- train %>% filter(Height >= 150 & Height <= 180)


To better discover the relationship between “Heigth” and “Weigth”, we can introduce a new column called “BMI” to the dataset.

train$BMI <- as.numeric(train$Weight/((train$Height/100)^2))


Similarly, we have to manage the test dataset in the same way.

test <- test %>% filter(Low.Blood.Pressure >= 60 & Low.Blood.Pressure <= 140)
test <- test %>% filter(High.Blood.Pressure >= 90 & High.Blood.Pressure <= 250)
test <- test %>% filter(Low.Blood.Pressure < High.Blood.Pressure)
test <- test %>% filter(Height >= 150 & Height <= 180)
test <- test %>% filter(Weight >= 51 & Weight <= 107)
test$BMI <- as.numeric(test$Weight/((test$Height/100)^2))

Checking the Dataset

Now, we can go through the summary of the train dataset after all those manipulation. Here, we introduce a new column called “Disease_Type” which is as same as the Disease column.

train$Disease_Type <- as.factor(ifelse(train$Disease == 1, "Contracted", "Healthy"))
dim(train)
## [1] 42397    14
str(train)
## 'data.frame':    42397 obs. of  14 variables:
##  $ Age                : num  59 64 41 39 54 48 51 42 56 64 ...
##  $ Gender             : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 1 1 2 1 ...
##  $ Height             : num  167 150 166 162 163 159 171 161 170 165 ...
##  $ Weight             : num  88 71 83 61 61 89 71 72 83 75 ...
##  $ BMI                : num  31.6 31.6 30.1 23.2 23 ...
##  $ High.Blood.Pressure: num  130 140 100 110 120 150 110 150 140 130 ...
##  $ Low.Blood.Pressure : num  68 100 70 80 80 90 70 90 90 60 ...
##  $ Cholesterol        : Factor w/ 3 levels "high","normal",..: 2 2 2 1 2 1 2 1 2 2 ...
##  $ Glucose            : Factor w/ 3 levels "high","normal",..: 2 2 2 1 2 1 2 1 2 2 ...
##  $ Smoke              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Alcohol            : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ Exercise           : num  1 0 1 1 1 1 1 1 1 1 ...
##  $ Disease            : num  0 1 0 0 0 1 0 1 1 1 ...
##  $ Disease_Type       : Factor w/ 2 levels "Contracted","Healthy": 2 1 2 2 2 1 2 1 1 1 ...

Exploratory Data Analysis and Data Visualization

First of all, demonstarting the distribution of all the numeric columns by using histogram. Excluding those binary attributes, some of them are right skewed while some of them do not follow the pattern of nornal distribution.

train %>% keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(x = value)) +
  facet_wrap(~key, scales = "free") +
  geom_histogram(bins = 15, fill = "lightblue", color = "white") +
  labs(title = "Show Distribution of Each Numeric Column")


Before visualizing all the data, we might need to visualize the target variable at first. It is good to see that the target variable is balenced.

barplot(table(train$Disease_Type), col = c("lightblue","pink"), main = "Barplot of Disease", ylab = "Count")


Stacked Barplots for Gender, Cholesterol, and Glucose
Then, using barplots to display the frequency of categorical variables, such as Gender, Cholesterol, and Glucose.

* Gender: The number of male who contracts the disease is similar with that of male who does not. Likewise, The number of female who contracts the disease is slightly higher than that of female who does not. However, it is easy to find out that the number of female who gets the disease (which is 14,107) is significantly higher than that of male (which is 7,782). Similar pattern could be witnessed from the group of people who does not get the disease. Then, one assumption could be made is that female has a higher chance to get contracted. However, given that the number of female records in the dataset is higher than that of male, the assumption might not be confirmed.

* Cholesterol: The number of people who contracts the disease is similar with that of people who does not. For those who got this disease, the proportion of people who has a normal Cholesterol value is the highest (which is 14,338). For those who has “High” or “Too.High” in Cholesterol value, the number of “Too.High” is higher than that of people who has a high Cholesterol value in the contracted group. On the contrary, the pettern is quit the opposite for the healthy group. Hence, we can make an assumption that people who has an rather high Cholesterol value tends to contract that disease.

* Glucose: For those who get this disease, the proportion of people who has a normal Glucose value is the highest (which is 17,859). Then the number of people who has a rather high Glucose value is slightly higher than that of people who has a high Cholesterol value. Similarly, the number of “Too.High” is higher than that of “High” in contracted group. In the healthy group, the conclusion is different. Then, it seems that a higher value in Glucose could also contributes to the chance of getting the disease.

gender <- as.data.frame(table(train$Gender,train$Disease_Type))
colnames(gender) <- c("Gender","Disease","Count")
Gender.plot <- ggplot(gender, aes(x = Disease, y = Count, fill = Gender, label = Count)) +
  geom_bar(stat = "identity") +
  geom_text(size = 5, position = position_stack(vjust = 1.1)) +
  theme_minimal() +
  scale_color_brewer(palette = "Paired")

Cholesterol <- as.data.frame(table(train$Cholesterol, train$Disease_Type))
colnames(Cholesterol) <- c("Cholesterol","Disease","Count")
Cholesterol.plot <- ggplot(Cholesterol, aes(x = Disease, y = Count, fill = Cholesterol, label = Count)) +
  geom_bar(stat = "identity") +
  geom_text(size = 5, position = position_stack(vjust = 1)) +
  theme_minimal() +
  scale_color_brewer(palette = "Paired")

Glucose <- as.data.frame(table(train$Glucose, train$Disease_Type))
colnames(Glucose) <- c("Glucose","Disease","Count")
Glucose.plot <- ggplot(Glucose, aes(x = Disease, y = Count, fill = Glucose, label = Count)) +
  geom_bar(stat = "identity") +
  geom_text(size = 5, position = position_stack(vjust = 1)) +
  theme_minimal() +
  scale_color_brewer(palette = "Paired")

ggarrange(Gender.plot, Cholesterol.plot, Glucose.plot, labels = c("Gender", "Cholesterol", "Glucose"), ncol = 2, nrow = 2)



Next, we might want to further explore the data distribution for contracted group in terms of continuous data type column, such as Age, BMI, Lower Blood Pressure, and Higher Blood Pressure. Here, Age and BMI will be displayed together while the two type of Blood Pressure will be displayed together.

Barplots for Age and BMI
* Age: For the healthy group, the distribution does not have certain fix pattern. For the disease contracted group, it seems that the number of people increases with the increases of age. Hence, we can make an assumption that elder people tends to get this disease.

* BMI: To better visualize the plot, we can round the BMI value to whole digit number. For both the healthy and contracted groups, the distributions seem quite similar. However, the BMI value of healthy people is generally higher than that of contracted patient.

Age <- as.data.frame(table(train$Age, train$Disease_Type))
colnames(Age) <- c("Age","Disease","Count")
Age.plot <- ggplot(Age, aes(x = Age, y = Count, fill = Disease)) +
  geom_col() + 
  ylim(c(0,1500)) +
  geom_text(aes(label = Count), vjust = -0.5) +
  facet_grid(Disease ~ ., scales = "free") +
  theme_minimal()

bmi <- as.data.frame(table(round(train$BMI,0), train$Disease_Type))
colnames(bmi) <- c("BMI","Disease","Count")
BMI.plot <- ggplot(bmi, aes(x = BMI, y = Count, fill = Disease)) +
  geom_col() + 
  facet_grid(Disease ~ ., scales = "free") +
  ylim(c(0,2800)) +
  geom_text(aes(label = Count), vjust = -0.5) +
  theme_minimal()

ggarrange(Age.plot, BMI.plot, labels = c("Age", "BMI"), ncol = 1, nrow = 2) 


Barplots for Blood Pressure
* Low Blood Pressure: To better visualize the plot, we can use filter to eliminate records with count that is lower than 10. Then, compared with healthy group, people with higher lower blood pressure seems to have a higher possibility to contract the disease. Namely, people with the disease has higher cholesterol level.

* High Blood Pressure: To better visualize the plot, we can use filter to eliminate records with count that is lower than 10. Comparing with the result obtained from the Low.Blood.Pressure barplot, similar conclusion could be made. The conclusion is obvious since the number of people who gets disease is definitly higher than that of people who does not.

Low.Blood <- as.data.frame(table(train$Low.Blood.Pressure, train$Disease_Type))
colnames(Low.Blood) <- c("Low.Blood.Pressure","Disease","Count")
Low.Blood <- Low.Blood %>% filter(Count >= 10)
Low.Blood.plot <- ggplot(Low.Blood, aes(x = Low.Blood.Pressure, y = Count, color = Disease, fill = Disease)) +
  geom_col() +
  facet_grid(Disease ~ ., scales = "free") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ylim(c(0,13000)) +
  geom_text(aes(label = Count), vjust = -0.5) +
  theme_minimal()

High.Blood <- as.data.frame(table(train$High.Blood.Pressure, train$Disease_Type))
colnames(High.Blood) <- c("High.Blood.Pressure","Disease","Count")
High.Blood <- High.Blood %>% filter(Count >= 10)
High.Blood.plot <- ggplot(High.Blood, aes(x = High.Blood.Pressure, y = Count, color = Disease, fill = Disease)) +
  geom_col() +
  facet_grid(Disease ~ ., scales = "free") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ylim(c(0,13000)) +
  geom_text(aes(label = Count), vjust = -0.5) +
  theme_minimal()

ggarrange(Low.Blood.plot, High.Blood.plot, labels = c("Low.Blood", "High.Blood"), ncol = 1, nrow = 2) 


Barplot for Alcohol
* Alcohol: Summarizing this barplot, we can find that, for people who drinks, the number of contracted patient is higher than that of healthy patient. On the contrary, for those who do not drink alcohol, the pattern is quit the opposite even though the gap is quit small. If we taking gender into consideration, we can find that the number of contracted female who drinks a lot is higher than that of contracted male who also drinks a lot. Then, drinking might have a minor influence the disease contract possibility.

Alcohol <- as.data.frame(table(train$Alcohol, train$Gender, train$Disease_Type))
colnames(Alcohol) <- c("Alcohol","Gender","Disease","Count")
Alcohol$Alcohol <- as.factor(ifelse(Alcohol$Alcohol == 0, "Drink", "Not Drink"))
ggplot(Alcohol, aes(x = Alcohol, y = Count, color = Disease, fill = Disease)) +
  geom_col(stat="identity", position="dodge") +
  facet_grid(Gender ~ ., scales = "free") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  theme_minimal()

str(train)
## 'data.frame':    42397 obs. of  14 variables:
##  $ Age                : num  59 64 41 39 54 48 51 42 56 64 ...
##  $ Gender             : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 1 1 2 1 ...
##  $ Height             : num  167 150 166 162 163 159 171 161 170 165 ...
##  $ Weight             : num  88 71 83 61 61 89 71 72 83 75 ...
##  $ BMI                : num  31.6 31.6 30.1 23.2 23 ...
##  $ High.Blood.Pressure: num  130 140 100 110 120 150 110 150 140 130 ...
##  $ Low.Blood.Pressure : num  68 100 70 80 80 90 70 90 90 60 ...
##  $ Cholesterol        : Factor w/ 3 levels "high","normal",..: 2 2 2 1 2 1 2 1 2 2 ...
##  $ Glucose            : Factor w/ 3 levels "high","normal",..: 2 2 2 1 2 1 2 1 2 2 ...
##  $ Smoke              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Alcohol            : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ Exercise           : num  1 0 1 1 1 1 1 1 1 1 ...
##  $ Disease            : num  0 1 0 0 0 1 0 1 1 1 ...
##  $ Disease_Type       : Factor w/ 2 levels "Contracted","Healthy": 2 1 2 2 2 1 2 1 1 1 ...

PCA Analysis & Correlation Plot

We can also use PCA techniques and draw correlation plot to discover the relationships among these columns.
Before continuing that, we might need to convert all the categorical data to numeric data via dummy function.

dummies <- train[,sapply(train, is.numeric)]
dmy <- caret::dummyVars("~Gender", data = train, fullRank = T)
dummies <- cbind(dummies, data.frame(predict(dmy, newdata = train)))
dmy <- caret::dummyVars("~Cholesterol", data = train, fullRank = F)
dummies <- cbind(dummies, data.frame(predict(dmy, newdata = train)))
dmy <- caret::dummyVars("~Glucose", data = train, fullRank = F)
dummies <- cbind(dummies, data.frame(predict(dmy, newdata = train)))

Go through the structure of dummies dataset.

str(dummies)
## 'data.frame':    42397 obs. of  17 variables:
##  $ Age                 : num  59 64 41 39 54 48 51 42 56 64 ...
##  $ Gender.male         : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Height              : num  167 150 166 162 163 159 171 161 170 165 ...
##  $ Weight              : num  88 71 83 61 61 89 71 72 83 75 ...
##  $ BMI                 : num  31.6 31.6 30.1 23.2 23 ...
##  $ High.Blood.Pressure : num  130 140 100 110 120 150 110 150 140 130 ...
##  $ Low.Blood.Pressure  : num  68 100 70 80 80 90 70 90 90 60 ...
##  $ Smoke               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Alcohol             : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ Exercise            : num  1 0 1 1 1 1 1 1 1 1 ...
##  $ Cholesterol.high    : num  0 0 0 1 0 1 0 1 0 0 ...
##  $ Cholesterol.normal  : num  1 1 1 0 1 0 1 0 1 1 ...
##  $ Cholesterol.too.high: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Glucose.high        : num  0 0 0 1 0 1 0 1 0 0 ...
##  $ Glucose.normal      : num  1 1 1 0 1 0 1 0 1 1 ...
##  $ Glucose.too.high    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Disease             : num  0 1 0 0 0 1 0 1 1 1 ...

Same process for test dataset

dummies_test <- test[,sapply(test, is.numeric)]
dmy_test <- caret::dummyVars("~Gender", data = test, fullRank = T)
dummies_test <- cbind(dummies_test, data.frame(predict(dmy_test, newdata = test)))
dmy_test <- caret::dummyVars("~Cholesterol", data = test, fullRank = F)
dummies_test <- cbind(dummies_test, data.frame(predict(dmy_test, newdata = test)))
dmy_test <- caret::dummyVars("~Glucose", data = test, fullRank = F)
dummies_test <- cbind(dummies_test, data.frame(predict(dmy_test, newdata = test)))


PCA Analysis
We can use PCA analysis to visualize how this data correlate with each other and in which degree this data stays significantly to the dataset. This is easy for now since we are using the whole dataset, but will require some modifications for model fitting with a train and test set.

# pca analysis
vis_pca <- function(df){
  res.pca <- prcomp(df, scale = TRUE)
  fviz_pca_var(res.pca,
               col.var = "contrib", # Color by contributions to the PC
               gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
               repel = TRUE     # Avoid text overlapping
  )
}
vis_pca(dummies[,-c(17)])


The PCA plot is rather messy. We need to drop columns that have less contribution to the dataset or columns that have similar direction. Then we can get features (or columns) that are significant to the dataset.

dummies_pca <-dummies[,c(1,5,6,7,8,12,13)]
vis_pca(dummies_pca)


Correlation Plot
Then, we can generate a correlation plot to further investigate the relationships amongh those variables.

After analyzing the plot, there are some obvious correlations here: 1. Weight and BMI has strong positive correlation since BMI derives from this column.
2. High.Blood.Pressure (“Normal”) and Low.Blood.Pressure (“Normal”) are strongly positively correlated with the other two factors in their variable (“High” and “Too.High”).
3. We can summarise that Age, Blood pressure (both), and Cholesterol (“High” and “Too.High”) have a positive relationship with the disease.
4. Weight, BMI, and Glucose are slightly negatively correlated with the disease.
5. Among those variables, Blood pressure have a strong relationshio among those features.

Even though the multicollinearity will theoratically impact the performance of regression models, we will still going to use all of the dummy variable in order to find out the most important feature.

cor_matrix <- cor(dummies[complete.cases(dummies), sapply(dummies, is.numeric)], method = "pearson")
corrplot(cor_matrix, type = "upper")

Models Construction

Now it is time to establish models. In this case, we will using several Classification Models to fit the dataset since this problem is a classification and regression problem. Each of the model has its own merits and disadvantages. Hence, we need to use several performance evaluation matrix or techniques to assess model performance. Those models are:
* Logistic Regression
* Deep Learning
* Naive Bayes Classifier
* KNN or k-Nearest Neighbors
* Random Forest
* Gradient Boosting Machine
* Linear Support Vector Machines
* SVM with RBF Kernal

Data Sampling

To evaluate the model performance, we need to splite the tagged dataset (which is the train dataset) into two. The sample_train subset will be used for model training and tunning while the sample_test will be used to test predict ability by calculating Accuracy, Specificity, Recall, and ROC value. Besides, since the original train dataset contains at least 40,000 rows of records, we need to sample a subset (which contains 18,000 records) from it in order to reduce the cost of time.
Besides, according to the PCA result, we could remove Gender, Height, Glucose.high, and Cholesterol.high. Removing Glucose.high, and Cholesterol.high is that Glucose.high and Cholesterol.high are positively correlated with Glucose.too.high, and Cholesterol.too.high, giveing that we might need to conduct Regression model, we could eliminate the those two columns to solve multicollinearity issue. Besides, accoring to one-hotspot requirement, you have to eliminate one and keep that as a base line.

set.seed(2000)
dummies$Disease.Type <- as.factor(train$Disease)
dummies_sample <- sample_n(dummies[,-c(2,3,11,14,17)], 18000)
# Same process for Test dataset
New_Test <- dummies_test[,-c(2,3,11,14,17)]

check the structure again.

str(dummies_sample)
## 'data.frame':    18000 obs. of  13 variables:
##  $ Age                 : num  62 39 44 59 59 48 55 46 58 42 ...
##  $ Weight              : num  80 103 55 74 85 71 99 56 88 72 ...
##  $ BMI                 : num  29.4 37.8 21.5 31.2 31.2 ...
##  $ High.Blood.Pressure : num  130 120 90 110 120 110 130 120 160 120 ...
##  $ Low.Blood.Pressure  : num  90 80 60 70 80 80 80 80 100 75 ...
##  $ Smoke               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Alcohol             : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Exercise            : num  0 1 1 1 1 1 1 1 1 0 ...
##  $ Cholesterol.normal  : num  1 0 1 1 1 1 1 1 1 1 ...
##  $ Cholesterol.too.high: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Glucose.normal      : num  1 0 1 1 1 1 1 1 1 1 ...
##  $ Glucose.too.high    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Disease.Type        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...

Since some models are sensitive towards irrelevant attributes & outliers and noisy data, we need to use the standardized and centuralized dataset.
It is important to do Data Pre-Processing within the sample used to evaluate each model, to ensure that the results account for all the variability in the test. If the data set say normalized or standardized before the tuning process, it would have access to additional knowledge (bias) and not give an as accurate estimate of performance on unseen data.

pre_process <- preProcess(dummies_sample, method = c("scale", "center"))
dummies_scale <- predict(pre_process, newdata = dummies_sample)

# Same process for Test Dataset
pre_process_test <- preProcess(New_Test, method = c("scale", "center"))
scale_test <- predict(pre_process_test, newdata = New_Test)

Next, randomly sample the subset into two. The sample_train contains 70% of the original subset data while the sample_test contains the rest of them.

train_index <- createDataPartition(dummies_scale$Disease.Type, p = 0.7, list = FALSE)
sample_train <- dummies_scale[train_index,]
sample_test <- dummies_scale[-train_index,]

Logistic Regression

Logistic regression is one of the most classic and powerful analysis tool in data analysis. However, it still requires severl pre-requisites, or the so-called assumptions, to be satiated, otherwise the result will be tortured. Hence, people introduced a penalty system to regression which are Lasso- and Ridge- logistic regression. In this project, we will, at first, conduct a normal logistic regression and then, using caret package, conduct a lasso and ridge logistic regression.

Logistic Regression by Using Step Function
We can use the setp function to conduct setpwise logistic regression. This model will be viwed as the baseline model of logistic regression experiments. Here we can see that the final optimal result is listed below. If we look at the z-value, it is clear that Weight and Alchohol are not significant since their values are higher than 0.05. In this case, we should have eliminate these two. However, considering that we will conduct lasso and ridge logistic regression later, we could keep them and see what will happen. Now let’s look at the deviance residuals, we know that this dataset is slightly right skewed.

logistic.all <- stats::step(glm(Disease.Type~.,data=sample_test,family = binomial()))

Then, we can use summary to check the best model. In this final model, Age, BMI, High.Blood.Pressure, Low.Blood.Pressure, Smoke, Alcohol, Exercise, Cholesterol.normal, Cholesterol.too.high, and Glucose.too.high have been selected.All of those elements are significant in p-value.

summary(logistic.all)
## 
## Call:
## glm(formula = Disease.Type ~ Age + BMI + High.Blood.Pressure + 
##     Low.Blood.Pressure + Smoke + Alcohol + Exercise + Cholesterol.normal + 
##     Cholesterol.too.high + Glucose.too.high, family = binomial(), 
##     data = sample_test)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4769  -0.9271   0.1273   0.9391   2.4718  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           0.05761    0.03155   1.826 0.067810 .  
## Age                   0.33491    0.03241  10.332  < 2e-16 ***
## BMI                   0.10034    0.03268   3.071 0.002136 ** 
## High.Blood.Pressure   0.98220    0.05507  17.834  < 2e-16 ***
## Low.Blood.Pressure    0.11593    0.04858   2.387 0.017005 *  
## Smoke                -0.06881    0.03450  -1.995 0.046082 *  
## Alcohol              -0.11685    0.03416  -3.421 0.000624 ***
## Exercise             -0.10614    0.03063  -3.465 0.000531 ***
## Cholesterol.normal   -0.17366    0.04007  -4.334 1.46e-05 ***
## Cholesterol.too.high  0.18489    0.04581   4.036 5.44e-05 ***
## Glucose.too.high     -0.08863    0.03591  -2.468 0.013593 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7484.5  on 5398  degrees of freedom
## Residual deviance: 6071.1  on 5388  degrees of freedom
## AIC: 6093.1
## 
## Number of Fisher Scoring iterations: 4


Then, we can use anova() function to test the authenticity of the final model. It’s good to see that all those variables are significant in chi-test.

varImp(logistic.all)
##                        Overall
## Age                  10.332372
## BMI                   3.070592
## High.Blood.Pressure  17.834233
## Low.Blood.Pressure    2.386602
## Smoke                 1.994643
## Alcohol               3.420752
## Exercise              3.464845
## Cholesterol.normal    4.334189
## Cholesterol.too.high  4.036002
## Glucose.too.high      2.467852
anova(logistic.all, test="Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Disease.Type
## 
## Terms added sequentially (first to last)
## 
## 
##                      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                                  5398     7484.5              
## Age                   1   298.03      5397     7186.5 < 2.2e-16 ***
## BMI                   1   124.60      5396     7061.9 < 2.2e-16 ***
## High.Blood.Pressure   1   869.01      5395     6192.9 < 2.2e-16 ***
## Low.Blood.Pressure    1     6.87      5394     6186.0 0.0087871 ** 
## Smoke                 1    11.36      5393     6174.7 0.0007502 ***
## Alcohol               1    11.20      5392     6163.5 0.0008198 ***
## Exercise              1    11.23      5391     6152.2 0.0008057 ***
## Cholesterol.normal    1    63.71      5390     6088.5  1.44e-15 ***
## Cholesterol.too.high  1    11.34      5389     6077.2 0.0007577 ***
## Glucose.too.high      1     6.11      5388     6071.1 0.0134067 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Lasso logistic regression & Ridge Logistic Regression
LASSO is a penalized regression approach that estimates the regression coefficients by maximizing the log-likelihood function (or the sum of squared residuals) with the constraint that the sum of the absolute values of the regression coefficients. Lasso has the ability to automatically deletes unnecessary covariates.
First of all, we need to assign the target variables of train dataset to y_train while assign the rest variables to x_train by using as.matrix function. Then, we can use cv.glmnet function to get the best lambda (which has the lowest error rate) of lasso and ridge logistic regression respectively. Now we get the range of lambda for this sample_train dataset.

y_train <- as.factor(sample_train[,13])
x_train <- as.matrix(sample_train[,1:12])

cv_lasso <- cv.glmnet(x_train, y_train, type.measure="class", alpha = 1, nfolds=10, family="binomial")
cv_lasso$lambda.min
## [1] 0.0009823539
cv_ridge <- cv.glmnet(x_train, y_train, type.measure="class", alpha = 0, nfolds=10, family="binomial")
cv_ridge$lambda.min
## [1] 0.02166219

Here, we can use the combination of “glmnet” and “train” function equippted by caret package. Based on the lambda range we obtained above, we can set the tunGrid as “0 to 0.05 while each step is 0.0001”.

time_logistic <- Sys.time()
model_logistic <- train(Disease.Type ~ .,
                  data = sample_train,
                  method = "glmnet",
                  trControl = trainControl(method = "repeatedcv", number = 10, repeats = 3),
                  tuneGrid = expand.grid(alpha=0:1, lambda=seq(0, 0.05, by = 0.0001)))
time_logistic <- Sys.time() - time_logistic
save(model_logistic, file = "model_logistic.RData")
save(time_logistic, file = "time_logistic.RData")

Next, we can visualize the output.

plot(model_logistic)

resampleHist(model_logistic)

ggplot(varImp(model_logistic), main = "Variable Importance with Logistic Regression")


Combining the results of top 5 cases with the highest Accuracy rates with the results displayed above, we can conclude that setting alpha value as “1” and setting lambda as “0.001” is most beneficial for the RF model in this dataset.

model_logistic$results %>% top_n(5, wt = Accuracy) %>% arrange(desc(Accuracy))
##    alpha lambda  Accuracy     Kappa AccuracySD    KappaSD
## 1      1  1e-03 0.7290967 0.4584149 0.01272284 0.02539744
## 2      1  0e+00 0.7290702 0.4583621 0.01271626 0.02538437
## 3      1  1e-04 0.7290702 0.4583621 0.01271626 0.02538437
## 4      1  2e-04 0.7290702 0.4583621 0.01271626 0.02538437
## 5      1  3e-04 0.7290702 0.4583621 0.01271626 0.02538437
## 6      1  4e-04 0.7290702 0.4583621 0.01271626 0.02538437
## 7      1  5e-04 0.7290702 0.4583621 0.01271626 0.02538437
## 8      1  6e-04 0.7290702 0.4583621 0.01271626 0.02538437
## 9      1  7e-04 0.7290702 0.4583621 0.01271626 0.02538437
## 10     1  8e-04 0.7290702 0.4583621 0.01271626 0.02538437
## 11     1  9e-04 0.7290702 0.4583621 0.01271626 0.02538437

Storing the Hyperparameters.

parameter_LR <- c("Alpha = 1; Lambda = 0.001")

Performance Display
Display the model performance. The accuracy rate is 0.724. The specificity (which is 0.6718) is satisfactory. The value of sensitivity is 0.7766 which is relatievly high compared with specificity. This indicats that Lasso Logistic Regression has a better performance in detecting potential patient.

predict_logistic <- predict(model_logistic, newdata = sample_test[,-c(14)], type = "prob")
Matrix_logistic <- confusionMatrix(as.factor(ifelse(predict_logistic$`1`> 0.5, 1, 0)), as.factor(sample_test$Disease.Type))
Matrix_logistic
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2089  889
##          1  601 1820
##                                           
##                Accuracy : 0.724           
##                  95% CI : (0.7119, 0.7359)
##     No Information Rate : 0.5018          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4482          
##                                           
##  Mcnemar's Test P-Value : 1.045e-13       
##                                           
##             Sensitivity : 0.7766          
##             Specificity : 0.6718          
##          Pos Pred Value : 0.7015          
##          Neg Pred Value : 0.7518          
##              Prevalence : 0.4982          
##          Detection Rate : 0.3869          
##    Detection Prevalence : 0.5516          
##       Balanced Accuracy : 0.7242          
##                                           
##        'Positive' Class : 0               
## 

ROC line
Plotting Receiver Operating Characteristics (ROC) curve and displaying the value of Area Under the Curve (AUC).

roc <- roc(sample_test$Disease.Type, predict_logistic$`1`,levels = c(0, 1), direction = "<")
plot(roc)

auc(roc)
## Area under the curve: 0.792

Storing the performance index into Per_names data frame.

Per_names <- as.data.frame(c("Accuracy", "Sensitivity", "Specificity", "Precision", "Recall", "ROC", "Time"))
rownames(Per_names) <- c("Accuracy", "Sensitivity", "Specificity", "Precision", "Recall", "ROC", "Time")
colnames(Per_names) <- "LR"
Per_names$LR <- c(as.numeric(Matrix_logistic$overall[1]), as.numeric(Matrix_logistic$byClass[1]), as.numeric(Matrix_logistic$byClass[2]), as.numeric(Matrix_logistic$byClass[5]), as.numeric(Matrix_logistic$byClass[6]), as.numeric(roc$auc), time_logistic/60)

Utilizing the NBC model to predict the classification result of test data. Storing these results to the result data frame.

result <- data.frame()[1:dim(test)[1],]
result$ID <- 1:dim(test)[1]
result$LR <- predict(model_logistic, newdata = scale_test, type = "raw")
rownames(result) <- 1:dim(test)[1]

Artificial Neural Networks / Deep Learning Models

Artificial Neural Networks are best when the data is one-hot encoded, scaled and centered. In addition, other transformations may be beneficial as well to make relationships easier for the algorithm to identify. In this case, we have already dummified, centered and scaled the dataset.

Firstly, initialize a sequential model: The first step is to initialize a sequential model with keras_model_sequential(), which is the beginning of our Keras model. The sequential model is composed of a linear stack of layers. And then, we can add layers to the sequential model, which consists of input layer, hidden layer(s), and output layer.
* Hidden Layers: form the neural network nodes that enable non-linear activation using weights. The hidden layers are created using layer_dense().
The common arguments used here:
1. units: a positive integer representing the dimension of the output space of the layer;
2. activation: the name of the activation function to use. Available options are elu, softmax, selu, softplus, relu, sigmoid, tanh, lieanr, etc;
3. bias_initializer: specifies the initializer for the bias vector. Available options are Zeros, Ones, Constant, identity, etc;
4. weights: specifies the initial weights for layer;
5. input_shape: specifies the dimentionality of the input in the first layer of the model.
* Dropout Layers: are used to control overfitting. This eliminates weights below a cutoff threshold to prevent low weights from overfitting the layers.
* Output Layer: specifies the shape of the output and the method of assimilating the learned information. The output layer is applied using the layer_dense(). .

Hence, the whole process is:
1. Defining the model: providing the number of layers and neurons for each layer; and adding any regularization technique to avoid overfitting
2. Compilng the model: defining the optimizer; providing the loss function; and identifying the metrics
The common arguments used here:
optimizer: it specifies the optimization technique (e.g., SGD, Adam, etc.) used for weights and biases updating.
loss: it specifies the name of objective function or objective function (e.g., binary_crossentropy, categorical_crossentropy, etc.).
metrics: it specifies the list of metrics to be evaluated by the model during training and testing (e.g., binary_accuracy, etc.). It is also worth mentioning that more than one metrics can be used at the same time when training and testing a model.
3. Fitting the model: introducing the number of batches; the number of epochs; and the validation split
Full list of arguments for fit() function are:
batch_size: The number of samples per gradient updates. The default is 32.
epochs: The number of epochs to train the model;
callbacks: List of callbacks to be called during training;
validation_split: Float between 0 and 1. The model sets apart the last fraction of the x and y data provided and use it as a validation set.
validation_data: Data provided to evaluate the model metrics and loss at the end of each epoch. If provided, it will override validation_split.
4 .Evaluating the model: evaluating the model on the test data set; and demonstrating the plots.
5. Predicting the classes/probabilities for the test data set.

Now, let’s us adjust the datasets to fit the ANN model. We use the recipe() function to implement our preprocessing steps (includes centered and scaled data).

rec_obj <- recipe(Disease.Type ~ ., data = sample_train) %>%
  step_center(all_predictors(), -all_outcomes()) %>%
  step_scale(all_predictors(), -all_outcomes()) %>%
  prep(data = sample_train)

Here, we can use bake() function in the recipt package to any data set, and it processes the data following our recipe steps. We’ll apply to oursample_train and sample_test to convert from raw data to a machine learning dataset. Then, we need to store the target values as y_train and y_test, which are needed for modeling our ANN.

x_train <- bake(rec_obj, new_data = sample_train[,c(1:12)])
y_train <- ifelse(pull(sample_train, Disease.Type) == 1, 1, 0)

x_test <- bake(rec_obj, new_data = sample_test[,c(1:12)])
y_test <- ifelse(pull(sample_test, Disease.Type) == 1, 1, 0)


Perceptron
Here, let’s discuss the choices of hyperparameters in single layer perceptron.
For the output layer: Becasue we only have one output layer here, we need to set the unit as ‘1’ since this is a binary classification project. For binary values, the shape should be units = 1. We set the kernel_initializer = “uniform” and the activation = “sigmoid” (common for binary classification). The input_shape should be 12, which is the number of variables inside the dataset. Besides, we can also add a kernel_regularizer to regiularize the output in case of overfitting.
For compile process: The optimizer will be “adam” which learns the fastest (compared with SGD) and generally has a relatively wide range of successful learning rates. As for the loss function, we might use ‘binary_crossentropy’ which is the default loss function to use for binary classification problems.
For fit process: In the single layer perceptron, it seems that a small value (such as ‘20’) will generate a relatively higher accuracy rate compared with a large value (such as ‘1000’). In general, batch size of 32 is a good starting point, and you should also try with 64, 128, and 256. The epochs will be set as 30. Besides, we can also use callback_early_stopping() function to monitor a certain value and urge the iteration to stop if that value does not change anymore. For example, callback_early_stopping(monitor=‘loss’, patience=5) will monitor the ‘loss’ value. But in this project, we will not use this function since we want to thoroughly explore the data and conduct the experiments.

time_model_ann0 <- Sys.time()
model_ann0 <- keras_model_sequential()
model_ann0 %>%
  layer_dense(units = 1, kernel_initializer = "uniform", activation = "sigmoid", input_shape = 12, kernel_regularizer = regularizer_l2(l = 0.001)) %>%
  compile(optimizer = "adam", loss = "binary_crossentropy", metrics = c("acc"))

history <- fit(object = model_ann0, x = as.matrix(x_train), y = y_train, batch_size = 256, epochs = 30, validation_split = 0.3)
time_model_ann0 <- Sys.time() - time_model_ann0

Storing the Hyperparameters.

parameter_ANN0 <- c("batch_size = 256, epochs = 30, validation_split = 0.3")

Then, we can visualize the model.

plot(history)

After geting the model, we can use the model to fit x_test dataset (which is the sample_test) and check the predicted result.

yhat_keras_class <- predict_classes(object = model_ann0, x = as.matrix(x_test)) %>% as.vector()
yhat_keras_prob  <- predict_proba(object = model_ann0, x = as.matrix(x_test)) %>% as.vector()

estimates_keras <- tibble(
  truth      = as.factor(y_test) %>% fct_recode(yes = "1", no = "0"),
  estimate   = as.factor(yhat_keras_class) %>% fct_recode(yes = "1", no = "0"),
  class_prob = yhat_keras_prob
)

Here is the confusion matrix of ANN0.

options(yardstick.event_first = FALSE)
estimates_keras %>% conf_mat(truth, estimate)
##           Truth
## Prediction   no  yes
##        no  2035  857
##        yes  655 1852

To better save the model evaluation result to the Per_names dataframe, we can create a function to warp the processes. And then add the result vector to the dataframe.

getResult <- function(estimates_keras, time){
  Acc <- estimates_keras %>% metrics(truth, estimate)
  Specificity <-estimates_keras %>% specificity(truth, estimate)
  Sensitivity <-estimates_keras %>% sensitivity(truth, estimate)
  Precision <- estimates_keras %>% precision(truth, estimate)
  Recall <-estimates_keras %>% recall(truth, estimate)
  Auc <- estimates_keras %>% roc_auc(truth, class_prob)
  Result <- c(as.numeric(Acc$.estimate[1]), as.numeric(Sensitivity$.estimate), as.numeric(Specificity$.estimate), as.numeric(Precision$.estimate),
                     as.numeric(Recall$.estimate), as.numeric(Auc$.estimate), as.numeric(time))
  return(Result)
}
Modelresult <- getResult(estimates_keras, time_model_ann0/60)
Per_names$ANN0 <- Modelresult

Utilizing the NBC model to predict the classification result of test data. Storing these results to the result data frame.

result$ANN0 <- predict_classes(object = model_ann0, x = as.matrix(scale_test)) %>% as.vector()


MLP - One Layer
For the first layer: The setting of kernel_initializer, kernel_regularizer, and input_shape are the same as we set in ANN0. Here, I set the units as 12 since there are 12 variables inside the dataset. In this way, the number of output of the first layer will be 12 which is correpsonding to that of input_shape. For the activition method, ‘relu’ has been chosen Then, we want the dropout layer to remove weights below 5%. Then, process like output layer will be conducted. The last process if compile the result, which is similar with ANN0.
For dropout layer: We use the layer_dropout() function add two drop out layers with rate = 0.05 to remove weights below 5%.

time_model_ann1 <- Sys.time()
model_ann1 <- keras_model_sequential()
model_ann1 %>%
  layer_dense(units = 12, kernel_initializer = "uniform", activation = "relu", input_shape = 12, kernel_regularizer = regularizer_l2(l = 0.001)) %>%
  layer_dropout(rate = 0.05) %>% 
  layer_dense(units = 1, kernel_initializer = "uniform", activation = "sigmoid") %>% 
  compile(optimizer = "adam", loss = "binary_crossentropy", metrics = c("acc"))

Using model_ann1 to fit the sample_train dataset (without target variable).

history <- fit(object = model_ann1, x = as.matrix(x_train), y = y_train, batch_size = 256, epochs = 30, validation_split = 0.3)
time_model_ann1 <- Sys.time() - time_model_ann1

Storing the Hyperparameters.

parameter_ANN1 <- c("units = 12, kernel_initializer = uniform, activation = relu, input_shape = 12, regularizer_l2(l = 0.001), layer_dropout = 0.05, batch_size = 256, epochs = 30, validation_split = 0.3")

Then, we can visualize the model. The pattern of accuracy line is similar with that of accuracy line we ploted in ANN0.

plot(history)


Again, using the model to fit x_test dataset and get the predicted result.

yhat_keras_class <- predict_classes(object = model_ann1, x = as.matrix(x_test)) %>% as.vector()
yhat_keras_prob  <- predict_proba(object = model_ann1, x = as.matrix(x_test)) %>% as.vector()

estimates_keras <- tibble(
  truth      = as.factor(y_test) %>% fct_recode(yes = "1", no = "0"),
  estimate   = as.factor(yhat_keras_class) %>% fct_recode(yes = "1", no = "0"),
  class_prob = yhat_keras_prob
)

Here is the confusion matrix of ANN1. For some reseaons, the reuslt is similar as we obtained from the ANN0’s confusion matrix. The Negative Flase decreases slighlt while the Negative True increases slighlt. We can assumed that the first layer did not bring benefits to the model pridiction performance.

estimates_keras %>% conf_mat(truth, estimate)
##           Truth
## Prediction   no  yes
##        no  2028  818
##        yes  662 1891

To better save the model evaluation result to the Per_names dataframe, we can create a function to warp the processes. And then add the result vector to the dataframe.

Modelresult <- getResult(estimates_keras, time_model_ann1/60)
Per_names$ANN1 <- Modelresult

Utilizing the NBC model to predict the classification result of test data. Storing these results to the result data frame.

result$ANN1 <- predict_classes(object = model_ann1, x = as.matrix(scale_test)) %>% as.vector()


MLP - Two Layers
For the first layer: The setting of kernel_initializer, kernel_regularizer, and input_shape are the same as we set in ANN1.
For the second layer: The setting are also the same except input_shape and rergularizer has been removed while the second dropout rate has been reduced to 0.01.
Using model_ann2 to fit the sample_train dataset (without target variable).

time_model_ann2 <- Sys.time()
model_ann2 <- keras_model_sequential()
model_ann2 %>%
  layer_dense(units = 12, kernel_initializer = "uniform", activation = "relu", input_shape = 12, kernel_regularizer = regularizer_l2(l = 0.001)) %>%
  layer_dropout(rate = 0.05) %>%         
  layer_dense(units = 12, kernel_initializer = "uniform", activation = "relu") %>%
  layer_dropout(rate = 0.01) %>% 
  layer_dense(units = 1, kernel_initializer = "uniform", activation = "sigmoid") %>% 
  compile(optimizer = "adam", loss = "binary_crossentropy", metrics = c("acc"))
history <- fit(object = model_ann2, x = as.matrix(x_train), y = y_train, batch_size = 256, epochs = 30, validation_split = 0.3)
time_model_ann2 <- Sys.time() - time_model_ann2

Storing the Hyperparameters.

parameter_ANN2 <- c("layer1: units = 12, kernel_initializer = uniform, activation = relu, input_shape = 12, regularizer_l2(l = 0.001); layer_dropout1 = 0.05; Layer2: units = 12, kernel_initializer = uniform, activation = relu; layer_dropout2 = 0.01, batch_size = 256, epochs = 30, validation_split = 0.3")

Then, we can visualize the model.

plot(history)


Again, using the model to fit x_test dataset and get the predicted result.

yhat_keras_class <- predict_classes(object = model_ann2, x = as.matrix(x_test)) %>% as.vector()
yhat_keras_prob  <- predict_proba(object = model_ann2, x = as.matrix(x_test)) %>% as.vector()

estimates_keras <- tibble(
  truth      = as.factor(y_test) %>% fct_recode(yes = "1", no = "0"),
  estimate   = as.factor(yhat_keras_class) %>% fct_recode(yes = "1", no = "0"),
  class_prob = yhat_keras_prob
)

Here is the confusion matrix of ANN1. For some reseaons, the reuslt is similar as we obtained from the ANN0’s confusion matrix. The Negative Flase decreases slighlt while the Negative True increases slighlt. We can assumed that the first layer did not bring benefits to the model pridiction performance.

estimates_keras %>% conf_mat(truth, estimate)
##           Truth
## Prediction   no  yes
##        no  2025  789
##        yes  665 1920

To better save the model evaluation result to the Per_names dataframe, we can create a function to warp the processes. And then add the result vector to the dataframe.

Modelresult <- getResult(estimates_keras, time_model_ann2/60)
Per_names$ANN2 <- Modelresult

Utilizing the NBC model to predict the classification result of test data. Storing these results to the result data frame.

result$ANN2 <- predict_classes(object = model_ann2, x = as.matrix(scale_test)) %>% as.vector()

Naive Bayes Classifer

Model Tunning
We can tune the few hyperparameters that a naïve Bayes model has:
* usekernel: allows us to use a kernel density estimate for continuous variables versus a guassian density estimate.
* adjust: allows us to adjust the bandwidth of the kernel density (larger numbers mean more flexible density estimate).
* fL: allows us to incorporate the Laplace smoother.
It is worthwhile to point out that we use the preProc function to normalize with Box Cox and reducing with PCA.

time_NBC <- Sys.time()
model_nb <- train(Disease.Type ~.,
                  data = sample_train,
                  method = 'nb',
                  preProc = c("BoxCox", "pca"),
                  trControl = trainControl(method = "repeatedcv", number = 10, repeats = 3),
                  tuneGrid = expand.grid(fL = 0:5, usekernel = c(TRUE, FALSE), adjust = seq(0, 5, by=1)))
time_NBC <- Sys.time() - time_NBC
save(model_nb, file = "model_nbc.RData")
save(time_NBC, file = "time_NBC.RData")

Visualize the model.

plot(model_nb)

resampleHist(model_nb)

ggplot(varImp(model_nb), main = "Variable Importance with NBC")


Combining the results of top 5 cases with the highest Accuracy rates with the results displayed above, we can conclude that the combination of setting usekernal as “TRUE” (density estimation is being used rather then normal density), setting adjust as ‘1’, and setting fL as “0” (no laplace smotthing effect) is most beneficial for NBC model in this dataset.

model_nb$results %>% top_n(5, wt = Accuracy) %>% arrange(desc(Accuracy))
##   fL usekernel adjust  Accuracy     Kappa AccuracySD    KappaSD
## 1  0      TRUE      1 0.7040974 0.4082755 0.01188002 0.02375687
## 2  1      TRUE      1 0.7040974 0.4082755 0.01188002 0.02375687
## 3  2      TRUE      1 0.7040974 0.4082755 0.01188002 0.02375687
## 4  3      TRUE      1 0.7040974 0.4082755 0.01188002 0.02375687
## 5  4      TRUE      1 0.7040974 0.4082755 0.01188002 0.02375687
## 6  5      TRUE      1 0.7040974 0.4082755 0.01188002 0.02375687

Storing the Hyperparameters.

parameter_NBC <- c("userKernel = 'True', Ajust = 1; fL = 0")

Performance Display
Display the model performance. The accuracy rate is 0.7033 which fine but is not very satisfactory. The accuracy rate is within the confidence Interval, implying that the result is trustful. Besides, the sensitivity (0.7178) is higher than specificity (which is 0.6888).

predict_nb <- predict(model_nb, newdata = sample_test[,-c(13)], type = "prob")
Matrix_nb <- confusionMatrix(as.factor(ifelse(predict_nb$`1`> 0.5, 1, 0)), as.factor(sample_test$Disease.Type))
Matrix_nb
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1931  843
##          1  759 1866
##                                           
##                Accuracy : 0.7033          
##                  95% CI : (0.6909, 0.7154)
##     No Information Rate : 0.5018          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.4066          
##                                           
##  Mcnemar's Test P-Value : 0.03811         
##                                           
##             Sensitivity : 0.7178          
##             Specificity : 0.6888          
##          Pos Pred Value : 0.6961          
##          Neg Pred Value : 0.7109          
##              Prevalence : 0.4982          
##          Detection Rate : 0.3577          
##    Detection Prevalence : 0.5138          
##       Balanced Accuracy : 0.7033          
##                                           
##        'Positive' Class : 0               
## 

ROC line
Plotting Receiver Operating Characteristics (ROC) curve and displaying the value of Area Under the Curve (AUC).

roc <- roc(sample_test$Disease.Type, predict_nb$`1`,levels = c(0, 1), direction = "<")
plot(roc)

auc(roc)
## Area under the curve: 0.7615

Storing the performance index into Per_names data frame.

Per_names$NBC <- c(as.numeric(Matrix_nb$overall[1]), as.numeric(Matrix_nb$byClass[1]), as.numeric(Matrix_nb$byClass[2]), as.numeric(Matrix_nb$byClass[5]), as.numeric(Matrix_nb$byClass[6]), as.numeric(roc$auc), as.numeric(time_NBC))

K Nearest Neighbor

Model Tunning
We can tune the few hyperparameters that a knn model has:
* k: Number of neighbors (K). A small value for K provides the most flexible fit, which will have low bias but high variance. Larger values of K will have smoother decision boundaries which means lower variance but increased bias.

time_KNN <- Sys.time()
model_KNN <- train(Disease.Type ~ .,
                   data = sample_train,
                   method = "knn",
                   trControl = trainControl(method = "repeatedcv", number = 10, repeats = 3),
                   tuneGrid = data.frame(k = seq(20:80)))
time_KNN <- Sys.time() - time_KNN
save(model_KNN, file = "model_KNN.RData")
save(time_KNN, file = "time_KNN.RData")

Visualize the model.

plot(model_KNN)

resampleHist(model_KNN)

ggplot(varImp(model_KNN), main = "Variable Importance with KNN")


Combining the results of top 5 cases with the highest Accuracy rates with the results displayed above, we can conclude that setting K value as “53” is most beneficial for the KNN model in this dataset. The Accuracy plot shows a general trends in the increase in performance with K value and that the larger value of k is probably preferred (range between 20 to 80).

model_KNN$results %>% top_n(5, wt = Accuracy) %>% arrange(desc(Accuracy))
##    k  Accuracy     Kappa AccuracySD    KappaSD
## 1 53 0.7221672 0.4445448 0.01486998 0.02973428
## 2 54 0.7220347 0.4442769 0.01505249 0.03009922
## 3 36 0.7218753 0.4439252 0.01403836 0.02807602
## 4 49 0.7218234 0.4438460 0.01559124 0.03117659
## 5 55 0.7218232 0.4438559 0.01554141 0.03107740

Storing the Hyperparameters.

parameter_KNN <- c("K = 53")

Performance Display
Display the model performance. The accuracy rate is 0.7229 which is slightly higher than that of NBC model. The specificity (which is 0.6781) is lower than the value of NBC model. Compared with the sensitivity value of NBC model, the value of KNN (0.7680) is significantly higher, indicating that KNN has a better performance in detecting every possible patient.

predict_KNN <- predict(model_KNN, newdata = sample_test[,-c(13)], type = "prob")
Matrix_KNN <- confusionMatrix(as.factor(ifelse(predict_KNN$`1`> 0.5, 1, 0)), as.factor(sample_test$Disease.Type))
Matrix_KNN
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2066  872
##          1  624 1837
##                                           
##                Accuracy : 0.7229          
##                  95% CI : (0.7108, 0.7348)
##     No Information Rate : 0.5018          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.446           
##                                           
##  Mcnemar's Test P-Value : 1.702e-10       
##                                           
##             Sensitivity : 0.7680          
##             Specificity : 0.6781          
##          Pos Pred Value : 0.7032          
##          Neg Pred Value : 0.7464          
##              Prevalence : 0.4982          
##          Detection Rate : 0.3827          
##    Detection Prevalence : 0.5442          
##       Balanced Accuracy : 0.7231          
##                                           
##        'Positive' Class : 0               
## 

ROC line
Plotting Receiver Operating Characteristics (ROC) curve and displaying the value of Area Under the Curve (AUC).

roc <- roc(sample_test$Disease.Type, predict_KNN$`1`, levels = c(0, 1), direction = "<")
plot(roc)

auc(roc)
## Area under the curve: 0.7845

Storing the performance index into Per_names data frame.

Per_names$KNN <- c(as.numeric(Matrix_KNN$overall[1]), as.numeric(Matrix_KNN$byClass[1]), as.numeric(Matrix_KNN$byClass[2]), as.numeric(Matrix_KNN$byClass[5]), as.numeric(Matrix_KNN$byClass[6]), as.numeric(roc$auc), as.numeric(time_KNN))

Random Forest

Model Tunning
We can tune the few hyperparameters that a RF model has:
* ntree: the number of trees to grow. Larger the tree, it will be more computationally expensive to build models.
* mtry: how many variables we should select at a node split. The default value is p/3 for regression and sqrt(p) for classification. We should always try to avoid using smaller values of mtry to avoid overfitting.
* nodesize: how many observations we want in the terminal nodes. This parameter is directly related to tree depth. Higher the number, lower the tree depth. With lower tree depth, the tree might even fail to recognize useful signals from the data.

time_RF <- Sys.time()
model_RF <- train(Disease.Type ~ .,
                  data = sample_train,
                  method = "rf",
                  ntree = 1500,
                  trControl = trainControl(method = "repeatedcv", number = 10, repeats = 3),
                  tuneGrid = expand.grid(mtry = 1:5))
time_RF <- Sys.time() - time_RF
save(model_RF, file = "model_RF.RData")
save(time_RF, file = "time_RF.RData")

Visualize the model.

plot(model_RF)

resampleHist(model_RF)

ggplot(varImp(model_RF), main = "Variable Importance with RF")


Combining the results of top 5 cases with the highest Accuracy rates with the results displayed above, we can conclude that setting mtry value as “2” and setting ntree as “1500” is most beneficial for the RF model in this dataset.

model_RF$results %>% top_n(5, wt = Accuracy) %>% arrange(desc(Accuracy))
##   mtry  Accuracy     Kappa AccuracySD    KappaSD
## 1    2 0.7274294 0.4551343 0.01435929 0.02869234
## 2    3 0.7215046 0.4432052 0.01320610 0.02638626
## 3    1 0.7165578 0.4334861 0.01475311 0.02946293
## 4    4 0.7116108 0.4233533 0.01362887 0.02723509
## 5    5 0.7043620 0.4087964 0.01273796 0.02545799

Storing the Hyperparameters.

parameter_RF <- c("mtry = 2 and;ntree = 1500")

Performance Display
After tuning, we have achieved an overall accuracy of 0.7268, which is the highest up to now. The sensitivity is 0.7933, which is also the highest among the given three models. The specificity (0.6608), however, is lowest among those models.

predict_RF <- predict(model_RF, newdata = sample_test[,-c(13)], type = "prob")
Matrix_RF <- confusionMatrix(as.factor(ifelse(predict_RF$`1`> 0.5, 1, 0)), as.factor(sample_test$Disease.Type))
Matrix_RF
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2134  919
##          1  556 1790
##                                           
##                Accuracy : 0.7268          
##                  95% CI : (0.7147, 0.7387)
##     No Information Rate : 0.5018          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4539          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7933          
##             Specificity : 0.6608          
##          Pos Pred Value : 0.6990          
##          Neg Pred Value : 0.7630          
##              Prevalence : 0.4982          
##          Detection Rate : 0.3953          
##    Detection Prevalence : 0.5655          
##       Balanced Accuracy : 0.7270          
##                                           
##        'Positive' Class : 0               
## 

ROC line
Plotting Receiver Operating Characteristics (ROC) curve and displaying the value of Area Under the Curve (AUC).

roc <- roc(sample_test$Disease.Type, predict_RF$`1`, levels = c(0, 1), direction = "<")
plot(roc)

auc(roc)
## Area under the curve: 0.7846

Storing the performance index into Per_names data frame.

Per_names$RF <- c(as.numeric(Matrix_RF$overall[1]), as.numeric(Matrix_RF$byClass[1]), as.numeric(Matrix_RF$byClass[2]), as.numeric(Matrix_RF$byClass[5]), as.numeric(Matrix_RF$byClass[6]), as.numeric(roc$auc), as.numeric(time_RF))

Gradient Boosting Machine

Model Tunning
We can tune the few hyperparameters that a XGBoost model has:
* nrounds: The max number of iterations.
* eta: Step size of each boosting step. After each boosting step, we can directly get the weights of new features, and eta shrinks the feature weights to make the boosting process more conservative.
* max_depth: Maximum depth of the tree. Increasing this value will make the model more complex and more likely to overfit.
* gamma: Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger gamma is, the more conservative the algorithm will be.
* min_child_weight: Minimum sum of instance weight (hessian) needed in a child. The larger min_child_weight is, the more conservative the algorithm will be.
* subsample: Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. and this will prevent overfitting. Subsampling will occur once in every boosting iteration.
* etc.

time_XGBoost <- Sys.time()
model_XGB <- train(Disease.Type ~ .,
                   data = sample_train,
                   method = "xgbTree",
                   trControl = trainControl(method = "repeatedcv", number = 10, repeats = 3),
                   tuneGrid = expand.grid(nrounds = c(25,50,100,150,200,250,300),
                                          max_depth = c(2,3,5,7,9),
                                          colsample_bytree = seq(0.1, 1, length.out = 5),
                                          eta = 0.1,
                                          gamma=0,
                                          min_child_weight = 1,
                                          subsample = 1))
time_XGBoost <- Sys.time() - time_XGBoost
save(model_XGB, file = "model_XGB.RData")
save(time_XGBoost, file = "time_XGBoost.RData")

Visualize the model.

plot(model_XGB)

resampleHist(model_XGB)

ggplot(varImp(model_XGB), main = "Variable Importance with XGB")


Combining the results of top 5 cases with the highest Accuracy rates with the results displayed above, we can conclude that the combination of setting nrounds = “100”, max_depth = “2”, eta = “0.1”, gamma = “0”, colsample_bytree = “0.550”, min_child_weight = “1”, and subsample = “1” is most beneficial for the XGBoost model in this dataset.

model_XGB$results %>% top_n(5, wt = Accuracy) %>% arrange(desc(Accuracy))
##   eta max_depth gamma colsample_bytree min_child_weight subsample nrounds
## 1 0.1         2     0            0.550                1         1     200
## 2 0.1         2     0            1.000                1         1     150
## 3 0.1         2     0            0.775                1         1     150
## 4 0.1         2     0            0.775                1         1     100
## 5 0.1         2     0            0.550                1         1     300
##    Accuracy     Kappa AccuracySD    KappaSD
## 1 0.7299950 0.4601860 0.01237504 0.02474061
## 2 0.7298100 0.4597993 0.01354413 0.02707640
## 3 0.7297307 0.4596559 0.01293357 0.02586130
## 4 0.7296778 0.4595701 0.01366865 0.02733184
## 5 0.7296511 0.4594872 0.01279452 0.02558219

Storing the Hyperparameters.

parameter_XGB <- c("nrounds = 100; max_depth = 2; eta = 0.1; gamma = 0; colsample_bytree = 0.550; min_child_weight = 1; and subsample = 1")

Performance Display
After tuning, we have achieved an overall accuracy of 0.7292, which is slightly higher than that of RF. The sensitivity is 0.7680, which is the second highest. The specificity (which is 0.6907), is the highest up to now.

predict_XGB <- predict(model_XGB, newdata = sample_test[,-c(13)], type = "prob")
Matrix_XGB <- confusionMatrix(as.factor(ifelse(predict_XGB$`1`> 0.5, 1, 0)), as.factor(sample_test$Disease.Type))
Matrix_XGB
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2066  838
##          1  624 1871
##                                          
##                Accuracy : 0.7292         
##                  95% CI : (0.7171, 0.741)
##     No Information Rate : 0.5018         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.4586         
##                                          
##  Mcnemar's Test P-Value : 2.538e-08      
##                                          
##             Sensitivity : 0.7680         
##             Specificity : 0.6907         
##          Pos Pred Value : 0.7114         
##          Neg Pred Value : 0.7499         
##              Prevalence : 0.4982         
##          Detection Rate : 0.3827         
##    Detection Prevalence : 0.5379         
##       Balanced Accuracy : 0.7293         
##                                          
##        'Positive' Class : 0              
## 

ROC line
Plotting Receiver Operating Characteristics (ROC) curve and displaying the value of Area Under the Curve (AUC).

roc <- roc(sample_test$Disease.Type, predict_XGB$`1`, levels = c(0, 1), direction = "<")
plot(roc)

auc(roc)
## Area under the curve: 0.7975

Storing the performance index into Per_names data frame.

Per_names$XGB <- c(as.numeric(Matrix_XGB$overall[1]), as.numeric(Matrix_XGB$byClass[1]), as.numeric(Matrix_XGB$byClass[2]), as.numeric(Matrix_XGB$byClass[5]), as.numeric(Matrix_XGB$byClass[6]), as.numeric(roc$auc), as.numeric(time_XGBoost))

Linear Support Vector Machine

Model Tunning
We can tune the few hyperparameters that a RF model has:
* C: The Penalty parameter C of the error term. A higher C might cause lower bias but risk of overfitting; A lower C might cause higher bias but lower variance.
* kernel: Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable.
* degree: Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels.
* gamma: Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. If gamma is ‘auto’ then 1/n_features will be used instead.
* class_weight: Set the parameter C of class i to class_weight[i]C for SVC. If not given, all classes are supposed to have weight one.
* etc.

time_SVM_Linear <- Sys.time()
model_svm_linear <- train(Disease.Type ~ .,
                          data = sample_train,
                          method = "svmLinear",
                          trControl = trainControl(method = "repeatedcv", number = 10, repeats = 3),
                          tuneGrid = expand.grid(C = seq(0.05, 1, 0.05)))
time_SVM_Linear <- Sys.time() - time_SVM_Linear
save(model_svm_linear, file = "model_svm_linear.RData")
save(time_SVM_Linear, file = "time_SVM_Linear.RData")

Visualize the model.

plot(model_svm_linear)

resampleHist(model_svm_linear)

ggplot(varImp(model_svm_linear), main = "Variable Importance with Linear SVM")


Combining the results of top 5 cases with the highest Accuracy rates with the results displayed above, we can conclude that setting C as “0.05” is most beneficial for the XGBoost model in this dataset. According to the Accuracy Plot, the Accuracy decrease with the increase of the cost value.

model_svm_linear$results %>% top_n(5, wt = Accuracy) %>% arrange(desc(Accuracy))
##      C  Accuracy     Kappa AccuracySD    KappaSD
## 1 0.05 0.7265563 0.4534426 0.01064391 0.02125527
## 2 0.10 0.7265034 0.4533367 0.01077677 0.02152037
## 3 0.20 0.7262919 0.4529143 0.01083971 0.02164551
## 4 0.15 0.7262654 0.4528605 0.01080065 0.02156808
## 5 0.50 0.7262390 0.4528083 0.01086612 0.02169839

Storing the Hyperparameters.

parameter_svm_linear <- c("C = 0.05")

Performance Display
After tuning, we have achieved an overall accuracy of 0.7222. The sensitivity is 0.8078, which is the highest up to now. The specificity (which is 0.6371), however, is the lowest among those models.

predict_svm_linear <- predict(model_svm_linear, newdata = sample_test[,-c(13)], type = "raw")
Matrix_svm_linear <- confusionMatrix(as.factor(predict_svm_linear), as.factor(sample_test$Disease.Type))
Matrix_svm_linear
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2173  983
##          1  517 1726
##                                         
##                Accuracy : 0.7222        
##                  95% CI : (0.71, 0.7341)
##     No Information Rate : 0.5018        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.4447        
##                                         
##  Mcnemar's Test P-Value : < 2.2e-16     
##                                         
##             Sensitivity : 0.8078        
##             Specificity : 0.6371        
##          Pos Pred Value : 0.6885        
##          Neg Pred Value : 0.7695        
##              Prevalence : 0.4982        
##          Detection Rate : 0.4025        
##    Detection Prevalence : 0.5846        
##       Balanced Accuracy : 0.7225        
##                                         
##        'Positive' Class : 0             
## 

ROC line
Normally, one cannot generate a plotting Receiver Operating Characteristics (ROC) curve for SVM Model.
Storing the performance index into Per_names data frame.

Per_names$Linear_SVM <- c(as.numeric(Matrix_svm_linear$overall[1]), as.numeric(Matrix_svm_linear$byClass[1]), as.numeric(Matrix_svm_linear$byClass[2]), as.numeric(Matrix_svm_linear$byClass[5]), as.numeric(Matrix_svm_linear$byClass[6]), "Nan", as.numeric(time_SVM_Linear))

Support Vector Machine with Non-linear Kernel: RBF

Here, for this SVM model with RBF Kernel, we only consider hyper-parameter “C” and “Sigma” (which is also called gamma). The default gamma value should be 1/k (k is the number of features inside the dataset). This hyper-parameter decides the distribution of data once the data have been projected. Generally, a higher gamma might cause a smaller number of support vector generated by the model, which directly influence the computational speed.

time_SVM_RBF <- Sys.time()
model_svm_rbf <- train(Disease.Type ~ .,
                       data = sample_train,
                       method = "svmRadial",
                       tuneGrid = expand.grid(sigma = seq(0, 0.5, 0.1), C = seq(0.05, 0.5, 0.1)),
                       trControl = trainControl(method = "repeatedcv", number = 10, repeats = 3))
time_SVM_RBF <- Sys.time() - time_SVM_RBF
save(model_svm_rbf, file = "model_svm_rbf.RData")
save(time_SVM_RBF, file = "time_SVM_RBF.RData")

Visualize the model.

plot(model_svm_rbf)

resampleHist(model_svm_rbf)

ggplot(varImp(model_svm_rbf, main = "Variable Importance with model_svm_rbf"))


Combining the results of top 5 cases with the highest Accuracy rates with the results displayed above, we can conclude that the combination of setting sigma = 0.1 and C = 0.45 is most beneficial for the XGBoost model in this dataset.

model_svm_rbf$results %>% top_n(5, wt = Accuracy) %>% arrange(desc(Accuracy))
##   sigma    C  Accuracy     Kappa AccuracySD    KappaSD
## 1   0.1 0.45 0.7284890 0.4572305 0.01077131 0.02151348
## 2   0.1 0.35 0.7283832 0.4570154 0.01055131 0.02107439
## 3   0.1 0.25 0.7282243 0.4566918 0.01040234 0.02077370
## 4   0.1 0.15 0.7281976 0.4566206 0.01018398 0.02034050
## 5   0.2 0.45 0.7269548 0.4541324 0.01164153 0.02326076

Storing the Hyperparameters.

parameter_svm_rbf <- c("C = 0.45 and sigma = 0.1")

Performance Display
After tuning, we have achieved an overall accuracy of 0.7277, which is a quit normal performance compared with others’. The sensitivity is 0.7836, which is slightly higher than KNN’s. The specificity is 0.6722, which is the highest among those 6 models.

predict_svm_rbf <- predict(model_svm_rbf, newdata = sample_test[,-c(13)], type = "raw")
Matrix_svm_rbf <- confusionMatrix(as.factor(predict_svm_rbf), as.factor(sample_test$Disease.Type))
Matrix_svm_rbf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2108  888
##          1  582 1821
##                                           
##                Accuracy : 0.7277          
##                  95% CI : (0.7156, 0.7396)
##     No Information Rate : 0.5018          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4557          
##                                           
##  Mcnemar's Test P-Value : 1.791e-15       
##                                           
##             Sensitivity : 0.7836          
##             Specificity : 0.6722          
##          Pos Pred Value : 0.7036          
##          Neg Pred Value : 0.7578          
##              Prevalence : 0.4982          
##          Detection Rate : 0.3904          
##    Detection Prevalence : 0.5549          
##       Balanced Accuracy : 0.7279          
##                                           
##        'Positive' Class : 0               
## 

ROC line
Normally, one cannot generate a plotting Receiver Operating Characteristics (ROC) curve for SVM Model.
Storing the performance index into Per_names data frame.

Per_names$SVM_RBF <- c(as.numeric(Matrix_svm_rbf$overall[1]), as.numeric(Matrix_svm_rbf$byClass[1]), as.numeric(Matrix_svm_rbf$byClass[2]), as.numeric(Matrix_svm_rbf$byClass[5]), as.numeric(Matrix_svm_rbf$byClass[6]), "Nan", as.numeric(time_SVM_RBF)*60)

Decision Tree

Since we want the decision tree to show the threshold, it would be better to use the original data instead of data with standardization (scale and center). Using the existing train_index to split the dummy_sample dataset (the 18,000 sampled data). Now, we will get a train_DT dataset and a test_DT dataset. The records inside the two datasets are exactly the same as data inside the sample_train and sample_test dataset.

train_DT <- dummies_sample[train_index,]
test_DT <- dummies_sample[-train_index,]

Now, we can use the train_DT and test_DT to run the model.

time_DT <- Sys.time()
model_DT <- train(Disease.Type ~., 
                  data = train_DT,
                  metric = "Accuracy", 
                  method = "rpart",
                  tuneGrid = expand.grid(cp=seq(0, 0.03, 0.001)),
                  control = rpart.control(minsplit = 30, minbucket = round(30/3), maxdepth = 6),
                  trControl = trainControl(method = "repeatedcv", number = 10, repeats = 3))
time_DT <- Sys.time() - time_DT
save(model_DT, file = "model_DT.RData")
save(time_DT, file = "time_DT.RData")

Visualize the model.

plot(model_DT)

resampleHist(model_DT)

ggplot(varImp(model_DT, main = "Variable Importance with Decision Tree"))

Storing the Hyperparameters.

parameter_dt <- c("cp = 0.001, minsplit = 30, minbucket = 10, maxdepth = 6")

Let’s visualize the conditions.
In this plot, High.Blood.Pressure is the most essential indicator which could be used to predict the disease. Here, we can summarise that people who tends to contract this disease has a higher upper boundary blood pressure (higher than 129). What’s more, the elder (whose age is sobe 52) who has a higher level of cholesterol also tends to get this disease. In this plot, Age and Cholesterol level (specially the high cholesterol level) are the second and thrid important indicators. The next significant indicators are Glocose level and lower boundary bood pressure (Low.blood.pressure). Even those who are younger than 52, if they has a higer level Cholesterol and a higher Cholesterol level of Glucose level, they are still inside the dangerous group.

fancyRpartPlot(model_DT$finalModel)

Here, we can use the test_DT to fit the model and get the Confusion Matrix. The accuracy rate is 0.7264, which seems acceptable. The value of sensitivity is 0.7900, which is good. The specificty (0.6633), however, is relatively lower than previous models.

predict_dt <- predict(model_DT, newdata = test_DT[,-c(13)], type = "prob")
Matrix_dt <- confusionMatrix(as.factor(ifelse(predict_dt$`1`> 0.5, 1, 0)), as.factor(sample_test$Disease.Type))
Matrix_dt
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2125  912
##          1  565 1797
##                                           
##                Accuracy : 0.7264          
##                  95% CI : (0.7143, 0.7383)
##     No Information Rate : 0.5018          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4531          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7900          
##             Specificity : 0.6633          
##          Pos Pred Value : 0.6997          
##          Neg Pred Value : 0.7608          
##              Prevalence : 0.4982          
##          Detection Rate : 0.3936          
##    Detection Prevalence : 0.5625          
##       Balanced Accuracy : 0.7267          
##                                           
##        'Positive' Class : 0               
## 

ROC line
Plotting Receiver Operating Characteristics (ROC) curve and displaying the value of Area Under the Curve (AUC).

roc <- roc(sample_test$Disease.Type, predict_dt$`1`, levels = c(0, 1), direction = "<")
plot(roc)

auc(roc)
## Area under the curve: 0.7827

Storing the performance index into Per_names data frame.

Per_names$DT <- c(as.numeric(Matrix_dt$overall[1]), as.numeric(Matrix_dt$byClass[1]), as.numeric(Matrix_dt$byClass[2]), as.numeric(Matrix_dt$byClass[5]), as.numeric(Matrix_dt$byClass[6]), as.numeric(roc$auc), as.numeric(time_DT/60))

Utilizing the NBC model to predict the classification result of test data. Storing these results to the result data frame.

result$DT <- predict(model_DT, newdata = New_Test, type = "raw")


Then use write.csv function to save the document to local directory.

write.csv(result, "xwanyue_hw04_result.csv")

Model Performance Evaluation


One should run the models several times with different hyper-parameters in order to optimize the results. To evaluate the performance of these models, we can use Model Performance Metrics to calculate the Accuracy, Precision (percent of predicted positive that are correct), Recall (percent of positive cases that are correctly predicted), and ROC.

Model Performance Validation

For the Model Performance Validation, there are mainly two types, which are Cross Validation and Bootstrapping.
* Cross Validation splits the available dataset to create multiple datasets.
* Bootstrapping method uses the original dataset to create multiple datasets after resampling with replacement.
* However, Bootstrapping it is not as strong as Cross validation when it is used for model validation since that bootstrapping is more about building ensemble models or just estimating parameters.
In others words, Cross-validation evaluates how well an algorithm generalizes, whereas bootstrapping actually helps the algorithm to generalize better.
* Hence, we decide to use 10-fold Crossvalidation for model performance validation even though Bootstrapping can reduce the variance of a classification algorithm that is based on a single model.

Model Summary

According to the analysis illustrated above, there are the Summary of these models:
As our problem is a part of classification, accuracy will be mesuared instead of the RMSE. “Accuracy” shows how many times a model predicted the right value/ class among all data set. A table which records all the model performance values has been dispalyed below.

* If we only look at the Accuracy, ANN2 and XGBoost seem to have a better performance after several times of running (both accuracy value are about 0.7293). Then it should be SVM_RBF, RF, and DT who have a similar accuracy value. ANN1 and LR also have a similar accuracy value. Then it should be KNN and Linear_SVM. NBC generally performed less satisfactory compared with other models.

* In this case, sensitivity is important than specificity due to the nature of this classification. Since we do not want to miss any potential patient, a higher sensitivity value might help us to achive this goal. If we look at Sensitivity Value, it is easy to find that SVM_Lieanr perform better than others. The second model is RF while the thrid one is DT. In this case, ANN0 has the lowest value.

* Generally, model with a ROC value that is higher than 0.75 is realiable. If we look at ROC value, it is good to see that the ROC value of all the models is above 0.75.

* If we look at the running time value (mins), one can summarise that ANN perforam faster than the rest models. DT and LR are also one of the most agile models in this plot. All of these models mentioned above controls the running time with 1 min. SVM_RBF, however, is the lowest since it took nearly 101 mins to train a dataset which only contains 12,600 records.

Ranking based on Accuracy:

Per_names[,order(Per_names[1,] ,decreasing = T)]
##                   ANN2          XGB           SVM_RBF         RF        DT
## Accuracy    0.73069087 7.292091e-01 0.727727356917948  0.7268013 0.7264308
## Sensitivity 0.70874862 7.680297e-01  0.78364312267658  0.7933086 0.7899628
## Specificity 0.75278810 6.906608e-01 0.672203765227021  0.6607604 0.6633444
## Precision   0.74274662 7.114325e-01 0.703604806408545  0.6989846 0.6997037
## Recall      0.70874862 7.680297e-01  0.78364312267658  0.7933086 0.7899628
## ROC         0.79919517 7.975037e-01               Nan  0.7845510 0.7826533
## Time        0.04387145 1.587926e+09  100.778596901894 50.7434168 0.1573362
##                   ANN1        LR       KNN        Linear_SVM      ANN0
## Accuracy    0.72587516 0.7240230 0.7229117 0.722170772365253 0.7199481
## Sensitivity 0.69804356 0.7765799 0.7680297 0.807806691449814 0.6836471
## Specificity 0.75390335 0.6718346 0.6781100 0.637135474344777 0.7565056
## Precision   0.74069722 0.7014775 0.7031995 0.688529784537389 0.7387316
## Recall      0.69804356 0.7765799 0.7680297 0.807806691449814 0.6836471
## ROC         0.79608039 0.7920410 0.7845164               Nan 0.7840066
## Time        0.04092058 0.8763096 6.1761439  29.7793678323428 0.1377632
##                    NBC
## Accuracy     0.7032784
## Sensitivity  0.7178439
## Specificity  0.6888151
## Precision    0.6961067
## Recall       0.7178439
## ROC          0.7614886
## Time        27.2396769

Ranking based on Sensitivity:

Per_names[c(2,6),order(Per_names[2,] ,decreasing = T)]
##                    Linear_SVM        RF        DT          SVM_RBF        LR
## Sensitivity 0.807806691449814 0.7933086 0.7899628 0.78364312267658 0.7765799
## ROC                       Nan 0.7845510 0.7826533              Nan 0.7920410
##                   KNN       XGB       NBC      ANN2      ANN1      ANN0
## Sensitivity 0.7680297 0.7680297 0.7178439 0.7087486 0.6980436 0.6836471
## ROC         0.7845164 0.7975037 0.7614886 0.7991952 0.7960804 0.7840066

Ranking based on Running time:

Per_names[c(7),order(Per_names[7,] ,decreasing = T)]
##           KNN       RF       Linear_SVM      NBC        XGB          SVM_RBF
## Time 6.176144 50.74342 29.7793678323428 27.23968 1587926116 100.778596901894
##             LR        DT      ANN0       ANN2       ANN1
## Time 0.8763096 0.1573362 0.1377632 0.04387145 0.04092058


Next, we can use some function to visualize the modelos performance.

model_comparison <- resamples(list(XGBoost = model_XGB, 
                                   LR = model_logistic,
                                   DT = model_DT,
                                   NBC = model_nb,
                                   KNN = model_KNN,
                                   RF = model_RF,
                                   SVM_Linear = model_svm_linear, 
                                   SVM_RBF = model_svm_rbf))
scales <- list(x = list(relation = "free"), y = list(relation = "free"))
bwplot(model_comparison, scales = scales)



Accoring to the requirement, we can create a data frame called hyper_all to store all the hyper-parameters we used in model tunning.

hyper_all <- as.data.frame(c(parameter_ANN0, parameter_ANN1, parameter_ANN2, parameter_LR, parameter_dt, parameter_NBC, parameter_KNN, parameter_RF, parameter_XGB, parameter_svm_linear, parameter_svm_rbf))
colnames(hyper_all) <- c("Hyper-Parameters")
rownames(hyper_all) <- c("ANN0", "ANN1", "ANN2", "Logistic Regression", "Decision Tree", "Naive Bayese", "KNN", "Random Forest", "XGBoost", "SVM_Linear","SVM_RBF")
hyper_all
##                                                                                                                                                                                                                                                                                               Hyper-Parameters
## ANN0                                                                                                                                                                                                                                                     batch_size = 256, epochs = 30, validation_split = 0.3
## ANN1                                                                                                                     units = 12, kernel_initializer = uniform, activation = relu, input_shape = 12, regularizer_l2(l = 0.001), layer_dropout = 0.05, batch_size = 256, epochs = 30, validation_split = 0.3
## ANN2                layer1: units = 12, kernel_initializer = uniform, activation = relu, input_shape = 12, regularizer_l2(l = 0.001); layer_dropout1 = 0.05; Layer2: units = 12, kernel_initializer = uniform, activation = relu; layer_dropout2 = 0.01, batch_size = 256, epochs = 30, validation_split = 0.3
## Logistic Regression                                                                                                                                                                                                                                                                  Alpha = 1; Lambda = 0.001
## Decision Tree                                                                                                                                                                                                                                          cp = 0.001, minsplit = 30, minbucket = 10, maxdepth = 6
## Naive Bayese                                                                                                                                                                                                                                                            userKernel = 'True', Ajust = 1; fL = 0
## KNN                                                                                                                                                                                                                                                                                                     K = 53
## Random Forest                                                                                                                                                                                                                                                                        mtry = 2 and;ntree = 1500
## XGBoost                                                                                                                                                                                  nrounds = 100; max_depth = 2; eta = 0.1; gamma = 0; colsample_bytree = 0.550; min_child_weight = 1; and subsample = 1
## SVM_Linear                                                                                                                                                                                                                                                                                            C = 0.05
## SVM_RBF                                                                                                                                                                                                                                                                               C = 0.45 and sigma = 0.1

Comparision between ANN0, Logistic Regression, and Linear SVM


First of all, we can display the performance result of those three models. As for Accuracy rate, LR model has the highest value while ANN0 has the lowest. If we focus on specificity value, the value of Liearn_SVM (0.8078) is much higher than that of lr (0.7766), let alone ANN0 (whose sensitivity value is 0.6836). For the ROC value, both LR and ANN0 could be treated as reliable. The value of Linear_SVM is lower than 0.75. For the running time, ANN0 model is the fastest among the three. The next is LR, which also took less than 1 min to train the model. The lowest is Linear_SVM which spend nealy 30 mins.

If we want to judge the three models based on Accuracy rate, it will not make much senses since they has a similar accuracy rate. If someone who want a model with a outstanding specificity rate, Linear_SVM is the best bet. If someone is looking for model that cost little of time, LR and ANN0 are suitable.

If we want to differantiate the three models based on their theory, one can say that Linear SVMs and logistic regression generally perform comparably in practice.
SVM works well with unstructured and semi-structured data like text and images while logistic regression works with already identified independent variables. SVM is based on geometrical properties of the data while logistic regression is based on statistical approaches. The logistic regression comes from generalized linear regression. The Support Vector Machines algorithm is much more geometrically motivated. Generally, if the number of features is small while the number of train records is large, one can choose logistic regression or linear SVM.

For the difference between logistic regression and perceptron, one can say that logistic regression has probabilistic connotations that go beyond the classifier use in ML. The perceptron classification algorithm is a more basic procedure, based on dot products between examples and weights. In some cases, the term perceptron is also used to refer to neural networks which use a logistic function as a transfer function (however, this is not in accordance with the original terminology). In that case, a logistic regression and a “perceptron” are exactly the same.

Per_names[,order(Per_names[1,] ,decreasing = T)][c(7,9,10)]
##                    LR        Linear_SVM      ANN0
## Accuracy    0.7240230 0.722170772365253 0.7199481
## Sensitivity 0.7765799 0.807806691449814 0.6836471
## Specificity 0.6718346 0.637135474344777 0.7565056
## Precision   0.7014775 0.688529784537389 0.7387316
## Recall      0.7765799 0.807806691449814 0.6836471
## ROC         0.7920410               Nan 0.7840066
## Time        0.8763096  29.7793678323428 0.1377632


### Relative Importance of Features Plot From the plots summary, one can see that, except logistic regression, the top 3 important features are the same in the rest models, which are “high.blood.pressure”, “low.blood.pressure, and age”. Generally, the fourth important feature is either BMI or Cholesterol Level.

dt <- ggplot(varImp(model_DT, main = "Variable Importance with Decision Tree"))
svm_rbf <- ggplot(varImp(model_svm_rbf, main = "Variable Importance with model_svm_rbf"))
svm_linear <- ggplot(varImp(model_svm_linear), main = "Variable Importance with Linear SVM")
XGB <- ggplot(varImp(model_XGB), main = "Variable Importance with XGB")
RF <- ggplot(varImp(model_RF), main = "Variable Importance with RF")
KNN <- ggplot(varImp(model_KNN), main = "Variable Importance with KNN")
NBC <- ggplot(varImp(model_nb), main = "Variable Importance with NBC")
LR <- ggplot(varImp(model_logistic), main = "Variable Importance with Logistic Regression")

ggarrange(dt, svm_rbf, svm_linear, XGB, RF, KNN, NBC, LR, labels = c("DT","SVM_RBF","SVM_Lieanr","XGBoost", "RF","KNN","NBC","LR"), ncol = 2, nrow = 4)


Recommendation & Limitation
For this case, one should investigate the relationships between hyperperameter and models. Due to the CPU limits, we only use several basic hyper-perameter for model tuning. Hence, we cannot guarantee that we have found out the optimized results. Besides, we only use 20,000 records of data, which also might cause some bias.

Finally